Improve lua safety, improve lua path resolution, fix various bugs, and add automated tests#318
Improve lua safety, improve lua path resolution, fix various bugs, and add automated tests#318jbms wants to merge 27 commits into
Conversation
If a Lua script registers a callback (e.g. for workspace focus or window mapping) and that callback raises an error, the compositor previously crashed with "unprotected lua error" because it used the unprotected `lua_call` API. This change introduces a `safe_pcall` helper that wraps `lua_pcall`. It is now used for all callback executions. If an error occurs, it is logged as an error, but the compositor continues running.
|
I also added a nix flake and a basic python-based automated test suite to test the fix. However, I can exclude those changes from this PR if you prefer. |
|
First of all, thank you very much for taking the time to do all this. I will be happy to merge the Lua part of this PR. These are my comments: CrashesI understand this PR deals with crashes due to an invalid Lua stack while executing Lua code. I would appreciate if you could provide some example (as a comment in this thread), so I can see how frequent this is. Your python test uses local args, state = ...
local scroll = require("scroll")
local views_table = {}
local function print_views()
for _, view in ipairs(views_table) do
print(scroll.view_get_app_id(view))
end
end
local function on_view_map(view, _)
table.insert(views_table, view)
print_views()
end
scroll.add_callback("view_map", on_view_map, nil)If I close an application, the next time I haven't fixed this problem because even though it is easy to do, it would add quite some runtime overhead to the system, as I would need to verify the validity of pointers for each operation. I may do this in the future if it becomes a problem, but as it is now, scroll can be easily crashed using Lua code, even with this PR's changes. External Dependencies and MaintenanceI thank you for the test suite and Nix flake, but I would rather not include things that I will not be able to maintain or add more external dependencies. Unfortunately, scroll is basically me. The project is not popular, and contributions like this one are rare. People come and go, and there is not a consistent structure of maintainers with assigned areas. There used to be a NixOS flake at some point here or in hyprscroller, I don't remember right now, and at some point the maintainer moved on, so people were filing issue reports I couldn't deal with. Currently, the flake is maintained externally by @Diax170, who has been a "faithful" scroll user for quite some time, and participates a lot in bug reporting and feature requests. I would rather keep it that way, and suggest that if you have improvements to what he is doing, submit PRs to his repository. That way, we will have strong Nix support from people who use Nix, otherwise the responsibility lies on me, and I am an Arch user. Again, thanks a lot for your help! |
fe2647f to
35fb02f
Compare
Containers that exactly touched the right or bottom edge of the output were being incorrectly considered invisible due to using `>=` instead of `>` in the bounds check. For example, on a 1280x720 screen, a single tiling window filling the screen has geometry x=0, y=0, width=1280, height=720. Its bottom edge is at y + height = 720, which is equal to maxy. The check: `y + height >= maxy` evaluated to true, marking the container as invisible. This caused commands like `jump` (which filter for visible containers) to fail to detect any windows when a single window was fullscreen or tiling and filling the screen. Reproduction: 1. Start scroll/sway with a single tiling window open. 2. Run the `jump` command. 3. Observe that jump mode is not entered (no labels appear, input is not captured). Fix this by using strict inequality `>` for the upper bounds check, so containers that touch the edge but do not exceed it are considered visible.
dfeafc6 to
14559da
Compare
|
Thanks for your quick response to my PR. I previously used Niri but started using scroll very recently because of the flexibility provided by the base sway functionality and the lua scripting support that you added. As a disclosure, I have used AI assistance to create these changes but I am a very experienced programmer and have thoroughly reviewed all of the changes myself. Re crashes: The specific ones that I ran into were missing/incorrect function names and invalid format strings. Regarding use-after-free crashes for the container pointers exposed to Lua code, I have now added an additional commit that fixes that problem by providing node ids rather than pointers to the lua code, and adds a hash table to make the lookup efficient (this also makes lookup by id for commands more efficient). Re tests: The tests only add python and pytest (available as python-pytest on arch) as a dependency --- using Python seemed like the most expedient choice, but I could also rework it to a different language or test framework if that would be preferable. I would say that I think for a project like this a test suite is particularly valuable, because there is a very stable API surface provided by the command interface, IPC interface, lua interface, plus normal wayland and X11 interface that can be used for testing in a way that can catch regressions without imposing much maintenance overhead. Given that there is a large amount of complicated behavior, and that you have to deal with merging in changes from sway and wlroots, I think having automated tests would be quite valuable. As an example, in the course of adding tests an additional bug with the visibility criteria used for jump was found, which I also fixed as a separate commit. Of course I am happy to exclude them from this PR if that is what you prefer though. Re nix: I am aware of the existing nix flake, in fact I am using it. It includes some extra logic to properly set up scroll as a wayland session, while I added just the subset needed to define the package and development environment. That makes it easy for nix users to build and test it. I removed it from this PR per your request. However, you might find it to be useful: nix terminology is kind of confusing but there is:
Nix the package manager can be used as an additional package manager on any linux distribution, such as arch linux, and indeed that is how I have used it also. (The only caveat is that a few things, like GL drivers, require some special treatment when using nix packages on non-nixos, in particular for GL you need to add an additional nixGL wrapper). It can be used to create isolated dev environments, or to set up per-user environments independent of the system package manager. It can also be convenient for testing on github actions since then you don't have to worry about separately maintaining github actions scripts for setting up dependencies. |
8128de3 to
d86148b
Compare
|
I added a number of additional bug fixes. |
|
I have nothing against well reviewed AI proposals, because I understand access to Goose, Agent Smith, Kythe or any other more up to date tools can increase productivity quickly, I just fear other people who only have access to Open Source models without support for a good integration of semi-large code bases due to limited context, will start spamming the repo with low quality PRs. I only had one AI generated PR before, but it was well reasoned (like this one), and I merged it. I am trying to review everything, but please, don't make this a hotchpotch. I would prefer to have one PR for the Lua stuff, and leave all those bug fixes for later. The issue with bug fixes is some are scroll problems, but most of the leaks are indirect leaks at exit (I also use libasan). This is inherited from sway, which doesn't do a proper cleanup at exit, but the leaks, as far I have investigated, are not adding memory overhead at runtime, just unfreed memory at exit. I usually try to change as few things as possible (from sway) if they are not really needed, to make merging easier. Leaks when the compositor is exiting are not a real problem for users. Of course, it is more elegant to remove them and have a completely clean exit -and it saves me time when looking at the libasan report-, but on the other hand, I don't really want to diverge the code base too much from sway because then merging their changes becomes much harder. There are two ways to go about this, one is to submit PRs to sway, so they also fix them and then I merge their changes, or, if the bug is really pressing because it affects runtime, and sway maintainers don't respond, we fix the bug here. If the project grows and we decide "we don't need sway any more", this may change, but I don't think the time has arrived yet, as code contributions like this one had never happened. I am going to keep on reading all this, I just wanted to send you a quick heads up. Thank you again for taking the time to do all this! |
|
@jbms I think you kind of messed up the branches lol FYI you usually keep your changes on a separate branch ( Let me know if you need me to clarify :) |
This commit fixes a bug in layout move commands (`layout_move_container_nomode` and `layout_move_container_to_workspace`) where they were calling `container_reap_empty` on the parent container. However, these parent containers were already being reaped/destroyed as part of the normal tree rearrangement (e.g., when the last child is moved out of a split container, or during simplification). The duplicate call resulted in trying to destroy the same container twice. When safety checks (node hash map) are enabled, this double reap triggers a "Node not present in table" abort in `node_map_remove` because the node is removed from the map during the first reap and is not present during the second. Without the node hash map safety checks, the double reap does not cause a crash or trigger ASan. The duplicate call to `container_begin_destroy` is safe because it returns early if `con->node.destroying` is already true, and the duplicate `node_set_dirty` also returns early. The container is only added to the transaction once and thus freed once. We removed the redundant `container_reap_empty` calls. Reproduction instructions: 1. Enable safety checks (node hash map) in `node.c`. 2. Start scroll. 3. Open two windows (they will be tiled vertically by default). 4. Focus the second window. 5. Run the command: `move left nomode`. 6. The compositor will abort/crash with "Node not present in table" in `node_map_remove`.
It doesn't matter what he has in his master branch, because the PR is pointing to the right branch. He may want to have his master contain all the changes, and then create different branches for several different PRs. My comment was suggesting there are too many things in a single PR, so that could be a way to split it. |
|
Also, since it's related, I've just added the dev shell to my flake. If anyone wants to report problems regarding it, open a PR or email me I guess (issues and discussions are disabled for some reason) |
Yes, that's possible but I wanted to inform them just in case they actually got confused... if you know what I mean (sorry if I just created even more confusion >.<) |
This commit introduces a safe lookup mechanism for container/view IDs. Instead of casting arbitrary integers to pointers (which could lead to crashes if invalid or stale IDs are used), we now maintain a hash map (node_table) mapping unique IDs to actual sway_node pointers. The node map is initialized during server startup (`server_init`) and finalized during shutdown in `main` after the root node has been destroyed. All nodes are registered in the map upon creation (`node_init`) and removed upon destruction (`node_fini`). This enables: - O(1) node lookup for `swap container with con_id`. - O(1) node lookup for criteria matching by `con_id`.
This commit fixes a use-after-free and double-reap crash when sending a tiled container to the scratchpad. When the container is tiled, `root_scratchpad_add_container` calls `container_set_floating`, which detaches the container from its tiled parent and reaps the parent if it becomes empty. However, `root_scratchpad_add_container` was then: 1. Accessing the `parent` pointer (which might have been freed during reap), causing a use-after-free. 2. Calling `container_reap_empty` on the parent again, causing a double-reap (abort in `node_map_remove`). 3. Calling `arrange_container(parent)` on the potentially freed parent. We fix this by: 1. Removing the redundant `container_reap_empty(parent)` call, as the parent is already reaped in `container_set_floating` (if it was tiled) or it is NULL (if it was already floating). 2. Using the safe node lookup (`node_by_id`) to verify if the parent still exists before calling `arrange_container(parent)`. If it was destroyed, we fallback to arranging the workspace. Reproduction instructions: 1. Enable safety checks (node hash map) in `node.c`. 2. Start sway. 3. Open two windows in a split container (e.g. open one window, split it, open a second window). 4. Focus the first window and run `move scratchpad`. 5. Focus the second window and run `move scratchpad` (this makes the parent split container empty, triggering the reap). 6. The compositor will crash due to UAF in `arrange_container`, or abort with "Node not present in table" if safety checks are enabled.
When a container is detached from its parent, we must clear it from the parent's pending focused_inactive_child pointer if it was pointing to it. Otherwise, the parent's pending focused_inactive_child remains dangling and can lead to use-after-free when container_get_active_view is called. The transaction model keeps current.focused_inactive_child safe because it is rebuilt from seat focus stack during transaction commits. However, pending.focused_inactive_child is maintained manually and was not being cleared on detach.
|
So this still contains a bunch of changes, but I've reorganized them as follows: The changes relevant to sway (node hash map plus a few bug fixes and the shutdown leak fixes) have been submitted as a separate PR to sway: This PR includes those exact commits as parents, in addition to the scroll-specific changes. I have tried to organize this PR into a set of commits where each commit contains one logical change. Therefore it may be easiest to review the commits individually rather than the overall changes. I am also happy to split this PR into multiple PRs however you like, exclude the tests, etc. The main reason to keep many of the bug fixes together with the node map changes is that the node map change adds stricter validation for node destruction which then causes the compositor to abort more readily in various cases without the bug fixes. As far as my master branch, yes I intentionally had that include all of the changes for my own use. |
|
It would be nicer to put the test suite into sway itself as well, and then just add scroll-specific changes here, but I understand that sway has previously rejected adding automated tests. |
This commit fixes several memory leaks in swaybar and sway compositor on exit. Swaybar fixes: - Call `pango_cairo_font_map_set_default(NULL)`, `cairo_debug_reset_static_data()`, and `FcFini()` on exit to release Pango/Cairo/Fontconfig static caches. - Free the DBus signal match slots and object vtable slots in watcher and host instead of leaving them floating, which pinned the bus and leaked memory. - Free the async property call slots when they complete. - Free the parsed JSON header in `i3bar_handle_readable` when it is invalid or after parsing it. Compositor exit fixes: - Free the pango font description in `free_config`. - Call `cairo_debug_reset_static_data()` and `FcFini()` in `sway/main.c` on exit. - Added fontconfig dependency to sway target. Reproduction (swaybar): 1. Run `swaybar` under LSan/ASan. 2. Feed it status lines with invalid/unsupported i3bar protocol JSON headers, or exit swaybar. 3. LSan will report leaks of parsed JSON headers in `status_line.c` and systemd/dbus match slots in `tray/`. Pango and Cairo static caches also leak. Reproduction (compositor): 1. Run sway under LSan. 2. Exit sway. 3. LSan will report leaks in pango (font description) and fontconfig (static data).
This commit resolves a use-after-free (UAF) crash that can occur during compositor shutdown when there are still active client windows. - Added check in `transaction_destroy` to verify if the node is still valid before accessing it, specifically handling cases where nodes are destroyed during shutdown before pending transactions are finished. - Ensure that nodes with active transaction references (`node->ntxnrefs > 0`) are not immediately freed if they are accessed by pending transactions. - Introduced `transaction_shut_down` to clean up pending/queued transactions and clear dirty flags on shutdown. - Implemented recursive node destruction in `root_destroy` (`container_destroy_recursive`, `workspace_destroy_recursive`, `output_destroy_recursive`) to ensure all nodes are properly marked as destroying and their transaction references are cleared. - Properly destroy input manager and seats on compositor teardown. Reproduction: 1. Build sway with ASan enabled (`-Db_sanitize=address`). 2. Open multiple clients (e.g., two terminals). 3. Exit/kill the compositor. 4. ASan will report a use-after-free (UAF) error. This occurs because during shutdown, nodes (containers/workspaces) are destroyed while they still have pending transaction references (`node->ntxnrefs > 0`). The transaction destruction attempts to access these already-freed nodes. Alternatively, run `pytest tests/test_leak_two_clients.py` which specifically reproduces this UAF on shutdown.
This commit resolves memory leaks of workspaces, containers, and outputs state lists during transaction application and destruction. - Instead of deep copying, we transfer ownership of the lists from the transaction instructions to the node's current state during `transaction_apply` (setting the instruction's list pointers to NULL). - Updated `transaction_destroy` to safely free the private lists associated with transaction instructions (which will only be non-NULL if the transaction was not applied). Reproduction: 1. Build sway with ASan/LSan enabled (`-Db_sanitize=address`). 2. Run sway with a status bar (like swaybar) and some clients. 3. Perform actions that trigger transactions (e.g., open and close windows, change focus). 4. Exit sway. 5. LSan will report memory leaks of `list_t` structures (and their item arrays) allocated in `copy_list` or transaction state initialization, specifically for `workspaces` (in `apply_output_state`), `floating`/`tiling` (in `apply_workspace_state`), and `children` (in `apply_container_state`). Alternatively, run the integration test `pytest tests/test_leak_two_clients.py` under ASan/LSan.
This merges the clean branch containing the following generic sway improvements and bug fixes: 1. tree: introduce node hash map for O(1) lookups and safety 2. Fix scratchpad UAF and double reap 3. Fix use-after-free of focused_inactive_child in container 4. Fix memory leaks and use-after-free at exit (generic)
Passing raw C pointers (light userdata) to Lua scripts is unsafe because if the underlying C object (workspace, container, view) is destroyed, the Lua script will hold a dangling pointer. Using this pointer in subsequent API calls will crash the compositor. We fix this by using node IDs (size_t) to reference nodes in Lua, and looking them up safely using node_by_id(). Reproduction: 1. Register a Lua callback that receives a container pointer. 2. Store this pointer in a global Lua variable. 3. Destroy the container. 4. Call a Lua API function passing the stored pointer. 5. The compositor will crash due to invalid pointer dereference. Alternatively, run the integration test 'pytest tests/test_lua_safety.py'.
This introduces the pytest-based test harness infrastructure, including conftest.py and IPC helpers, along with the C-based Wayland and X11 test clients used to simulate windows in integration tests.
Adds Python integration tests to verify Lua APIs (geometry, outputs, workspaces), safe pointer lookup (use-after-free prevention), and client management (mapping/unmapping clients).
- Use wordexp to expand the script path in the lua command, allowing tilde and environment variable expansion. - Enforce that the path must expand to exactly one file, returning specific errors otherwise. - Resolve relative paths against the directory of the loaded configuration file during config loading (and use initial working directory otherwise). - Update documentation in scroll.5.scd to mention wordexp and relative path support. - Mock HOME in tests to test tilde expansion safely.
This commit introduces a new Lua API function, `scroll.node_get_type`, which allows Lua scripts to retrieve the type of a given node ID. The function maps the node ID to a `sway_node` and returns its type as a string (e.g., "root", "output", "workspace", "container", etc.). If the node ID is invalid or not found, it returns `nil`. This commit also updates: - `scroll.lua` with LSP annotations for the new function. - `sway/scroll.5.scd` man page with documentation for the new function.
This commit adds unit tests for the newly introduced `scroll.node_get_type` Lua API function. It verifies that the function correctly returns the type for workspace and output nodes, and returns `nil` for invalid node IDs.
This commit adds a test that reproduces a use-after-free and double-reap crash when sending a tiled container to the scratchpad when it is the only child of its parent container.
This commit addresses potential use-after-free (UAF) risks in layout and space code: 1. `extract_view` in `sway/tree/layout.c` is refactored to use `container_detach` instead of manual list manipulation. This ensures that parent/workspace pointers are properly cleared on detach, avoiding dangling pointers. 2. `find_and_detach_container` in `sway/tree/space.c` is also refactored to use `container_detach` for tiling containers. 3. `layout_space_container_restore_tiling` and `layout_space_container_restore_floating` in `sway/tree/space.c` are updated to set the container's parent/workspace pointers BEFORE calling `arrange_container`, ensuring that `arrange_container` does not access old, potentially freed parent/workspace pointers. Reproduction: 1. Run scroll with space/layout features. 2. Trigger layout changes that use 'extract_view' (e.g. moving containers). 3. If parent pointers are not cleared, subsequent container arrangements can trigger UAF crashes. Alternatively, run the integration test 'pytest tests/test_space_crash.py'.
This commit adds: 1. `tests/test_space_crash.py` which verifies that space restore does not crash due to UAF during `arrange_container` if the old workspace was destroyed. 2. `tests/test_move_cleanup_crash.py` which verifies that move container cleanup properly handles workspace destruction and does not crash when arranging the old workspace.
This commit adds `tests/test_workspace_split_uaf.py` which reproduces a Use-After-Free (UAF) crash. The crash occurs when a workspace is split, one of the split workspaces is destroyed (because it is empty and inactive during output evacuation), and then the remaining sibling workspace is rearranged (e.g. when it also becomes empty or during workspace switch). The remaining workspace still has a dangling pointer to the destroyed sibling, leading to UAF in `arrange_output` when it tries to check sibling's fullscreen state.
When a split workspace is destroyed, its sibling workspace's split state was not being properly cleaned up. Specifically, the sibling's split.split was remaining set to WORKSPACE_SPLIT_HORIZONTAL or WORKSPACE_SPLIT_VERTICAL, and it still pointed to the destroyed workspace as its sibling. This led to use-after-free when the sibling workspace was later accessed or rearranged, as it tried to access the destroyed sibling workspace. We fix this by resetting the sibling's split state to WORKSPACE_SPLIT_NONE, clearing the sibling pointer, clearing the output box, marking the sibling node as dirty, and arranging the sibling workspace so it can claim the full output area. Reproduction: 1. Split a workspace into two sibling workspaces. 2. Close all windows on one of the sibling workspaces so it gets destroyed. 3. Interact with the remaining sibling workspace (e.g. move focus or windows). 4. The compositor will crash with a use-after-free (UAF) when attempting to access the destroyed sibling workspace. Alternatively, run the integration test 'pytest tests/test_workspace_split_uaf.py'.
Calling transaction_commit_dirty() directly inside workspace_switch_callback_end() can cause use-after-free (UAF) crashes if nodes (like workspaces) are destroyed during the animation cancel/cleanup process, and their destruction is processed while we are still in the middle of handling the workspace switch command. We fix this by deferring the transaction commit to the Wayland event loop idle handler. Reproduction: 1. Enable workspace switch animations. 2. Switch from Workspace 1 (non-empty) to Workspace 2. 3. While the animation is running, kill the client window on Workspace 1 (so Workspace 1 becomes empty and is marked for destruction). 4. Switch back to Workspace 1 (cancelling the animation). 5. The compositor will crash with a use-after-free (UAF) in transaction execution. Alternatively, run the integration test 'pytest tests/test_workspace_animation_uaf.py'.
Factor out common test helpers into conftest.py fixtures - Add wayland_client and wait_for_client_map as pytest fixtures in tests/conftest.py. - Remove duplicate definitions of these helpers from all 10 integration test files. - Update test functions to accept these fixtures as arguments with proper type annotations.
- Add tests/test_space_aba.py to test for potential ABA issues during space restore when a view is destroyed and a new one is created with the same address before loading the space. - Add tests/test_workspace_animation_uaf.py to test for potential use-after-free when a workspace is switched and the old workspace is destroyed during the animation.
- Modify struct sway_space_view to store container_id (size_t) instead of struct sway_view *view. - This avoids potential use-after-free and ABA problems when a view is destroyed and a new view is allocated at the same address before the space is restored. - Update space_save to record container->node.id. - Update space_load and helper functions to look up containers by ID using node_by_id(). If the container (and thus the view) was destroyed, it will safely return NULL and skip restoring it. Reproduction: 1. Run scroll. 2. Open a window and save the workspace/space layout. 3. Close the window. 4. Open a new window that happens to be allocated at the same memory address as the closed window. 5. Restore the space. 6. The compositor will attempt to move the new window into the restored space layout, incorrectly identifying it as the old window (ABA problem), or crash with UAF if the address is not reallocated. Alternatively, run the integration test 'pytest tests/test_space_aba.py'.
This commit contains the scroll-specific parts of the memory leak fixes at exit (spaces cleanup). Reproduction: 1. Run scroll. 2. Create one or more spaces (e.g. using scroll Lua API 'space_create'). 3. Exit scroll. 4. LSan will report memory leaks of 'struct sway_space' objects and the 'root->spaces' list.
This commit adds a test 'tests/test_leak_two_clients.py' that runs the compositor, spawns two dummy wayland clients, and terminates the compositor while clients are still active, verifying that the compositor exits cleanly (exit code 0) under ASan/LSan.
|
Sorry, I had to merge something I had been working on for the last week. Now I will have more time to look at this PR. Thank you. |
|
I would like to have the Lua part of the code as a separate PR. You say the rest of the bugs need to be included because the new node hash table is triggering many other bugs, but I don't necessarily agree with it because of what I explain below. I have been looking at it, and my main concern is you consider a node to be destroyed when So, because in your code you make the assumption that a node doesn't exist when begin destroy happens, then of course it triggers use after free errors, but those errors don't happen in reality, because the node still exists. In fact, in certain places, there are checks for There are reasons why objects don't get freed immediately, and it is all related to the transaction model. A transaction moves objects from one If you want to keep compatibility, you should move the node finalization code where the node gets really destroyed (for container, workspace, and output, the rest are OK), Edit: If your concern is about cleanup at application exit, you can always call |
|
Okay, I'm working on addressing your comments. One comment: I see that you have started cherry picking changes from sway ("sway chase:" commits). That makes it harder to deal with changes that apply to both sway and scroll. Would you consider instead using merge commits, so that scroll's git history properly records which sway changes have been merged in? |
It used to be like that. scroll was a pure fork of sway until a few months back (when the Edit: the merges are done manually, and verifying what applies and what doesn't. |
No description provided.